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43 ABSTRACT 

44 

45 Monoallelic gene expression is typically initiated early in the development of an organism. 

46 Dysregulation of monoallelic gene expression has already been linked to several non- 
47 Mendelian inherited genetic disorders. In humans, DNA-methylation is deemed to be an 

48 important regulator of monoallelic gene expression, but only few examples are known. One 

49 important reason is that current, cost-affordable truly genome-wide methods to assess DNA- 

50 methylation are based on sequencing post enrichment. Here, we present a new methodology 

51 that combines methylomic data from MethylCap-seq with associated SNP profiles to identify 

52 monoallelically methylated loci. Using the Hardy-Weinberg theorem for each SNP locus, it 

53 could be established whether the observed frequency of samples featured by biallelic 

54 methylation was lower than randomly expected. Applied on 334 MethylCap-seq samples of 

55 very diverse origin, this resulted in the identification of 80 genomic regions featured by 

56 monoallelic DNA-methylation. Of these 80 loci, 49 are located in genie regions of which 25 

57 have already been linked to imprinting. Further analysis revealed statistically significant 

58 enrichment of these loci in promoter regions, further establishing the relevance and 

59 usefulness of the method. Additional validation of the found loci was done using 14 whole- 

60 genome bisulfite sequencing data sets. Importantly, the developed approach can be easily 

61 applied to other enrichment-based sequencing technologies, such as the ChlP-seq-based 

62 identification of monoallelic histone modifications. 

63 
64 
65 
66 
67 
68 
69 

70 
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71 INTRODUCTION 

72 

73 For diploid organisms, gene expression is denoted as monoallelic if only one allele is 

74 transcriptionally active. The expressed allele can be randomly selected (e.g. X-chromosome 

75 inactivation and some autosomal genes) or predetermined by parental imprinting (Gimelbrant 

76 et al. 2007; Ferguson-Smith 2011; Fedoriw et al. 2012). Erroneous monoallelic expression 

77 has been associated to several genetic disorders, like the Prader-Willi syndrome, as well as 

78 to certain forms of cancer, like Wilms' tumor. Both diseases are caused by loss of imprinting 

79 of some genes in the 1 5q1 1 -q1 3 and 1 1 p1 5.5 region, respectively (Egger et al. 2004). 

80 Epigenetics is defined as the study of inheritable modifications on both chromatin and DNA 

81 that have an influence on gene expression without changing the underlying DNA sequence 

82 (Goldberg et al. 2007). Mammalian DNA-methylation is an epigenetic mark that is 

83 predominantly found in a CpG sequence context (Law et al. 2010). This methylation mark 

84 has been linked with gene expression and when located in the promoter region, it generally 

85 leads to transcriptional silencing of the corresponding gene (Jones 2012). As it is a defining 

86 feature of cellular identity and essential for normal development, its dysregulation is often 

87 associated with disease (Egger et al. 2004). Monoallelic DNA-methylation is likely to bare an 

88 important role in the regulation of monoallelic expression (Milani et al. 2009). In addition to 

89 DNA-methylation, histone modifications also contribute to the maintenance of monoallelic 

90 expression. The methylated, silenced allele is mostly sustained with the repressive histone 

91 modification histone H3 trimethylation at lysine 9 (H3K9me3) while the active allele is 

92 characterized by the permissive histone marker H3 trimethylation at lysine 4 (H3K4me3) 

93 (Kacem and Feil 2009). 

94 Although imprinting is a well-investigated topic and several studies already provided 

95 evidence (e.g. computational predictions based on DNA sequence characteristics or 

96 detection of monoallelic expression) of some regions with monoallelic DNA-methylation 

97 (Luedi et al. 2005, Luedi et al. 2007, Babak et al. 2008, Serre et al. 2008, Wang et al. 2008, 
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98 Morison et al. 2001; Zhang et al. 2010; Ferguson-Smith 2011; Fedoriw et al. 2012), only a 

99 few imprinted regions are well characterized in humans, like for example the IGF2/H19 

100 region. Furthermore, while monoallelic methylation has been shown to play an important role 

101 in the differentiation between tissues, little is known about the presence of loci that are stably 

102 monoallelically methylated in all tissues as well as the genome-wide character of monoallelic 

103 DNA-methylation. The recent advent of next-generation massively parallel sequencing 

104 platforms has introduced the possibility of genome-wide DNA-methylation profiling. Bisulfite 

105 sequencing, which combines bisulfite treatment of genomic DNA with the high-throughput 

106 sequencing of the entire genome, is the gold standard and allows to readily identify 

107 monoallelic methylated alleles (Shoemaker et al. 2010), but is very costly and therefore 

108 outside the reach of smaller projects. Fortunately, cost-effective alternatives based on the 

109 specific enrichment of methylated portions of the genome (i.e. enrichment-based methods) 

110 such as methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq) and 

111 methyl-CpG binding domain protein sequencing (MethylCap-seq) exist. Yet, these methods 

112 do neither provide single base pair resolution nor information regarding unmethylated alleles 

113 and are therefore not directly applicable to detect monoallelic events (Serre et al. 2010). 

114 While some approaches already tried to tackle this issue, they rely on the combination of 

115 multiple sequencing technologies, like for example the integrative method of Harris et al. 

116 (2010), which tries to find regions with intermediate and potentially monoallelic events by 

117 combining data originating from MeDIP-seq, methylation-sensitive restriction enzyme 

118 sequencing (MRE-seq), ribonucleic acid sequencing (RNA-seq) and chromatin 

119 immunoprecipitation followed by sequencing (ChlP-seq). 
120 

121 To circumvent these issues, we developed a data-analytical framework that solely uses data 

122 from enrichment-based sequencing (like MethylCap-seq), which screens for regions that 

123 exhibit monoallelic DNA-methylation based on classical population genetic theory. The 

124 developed pipeline first compares enrichment-based sequencing data of multiple samples to 

125 the public NCBI Single Nucleotide Polymorphism (SNP)-archive (dbSNP) in order to screen 
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126 the obtained non-duplicate, uniquely mappable sequence reads for SNPs. Only SNP loci with 

127 an adequately coverage and allele frequency are retained and the effect of sequencing 

128 errors is further reduced by comparing the chance of a sequencing error with the chance of 

129 detecting genuine SNPs. For each single SNP locus, the Hardy-Weinberg theorem is then 

130 applied to evaluate whether the observed frequency of samples featured by a biallelic event 

131 is lower than randomly expected (Mayo 2008). Using a permutation approach, confidence 

132 limits are simulated and genomic regions with a p-value smaller than the p-value 

133 corresponding with a given false discovery rate (FDR) can be assumed to harbour a 

134 monoallelic event. 
135 

136 Starting from MethylCap-seq data of a mixture of 334 Caucasian human samples and an 

137 FDR of 0.1, this methodology allowed the identification of 80 monoallelically methylated loci, 

138 for which significant enrichment was found in promoter regions. Of these 80 loci, 25 have 

139 previously been linked to imprinting. Additional validation of 44 of the found loci was done 

140 using 14 whole-genome bisulfite sequencing (WGBS) data sets. 29 loci (65.9%) showed 

141 evidence of monoallelic methylation in at least one of these samples. This outcome as well 

142 as the developed approach are innovative and provide new insights. Because a variety of 

143 tissues was used in the analysis, the detected loci are likely generally imprinted in a large set 

144 of tissues. Furthermore, our method allows the identification of monoallelically methylated 

145 loci in a parental-independent and genome-wide manner. Finally, because of the general 

146 rationale of the developed approach, it can be applied to enrichment-based sequencing 

147 applications to detect monoallelic features other than DNA-methylation. A possible 

148 application could be ChlP-seq (Furey 2012) to screen for monoallelic histone modifications 

149 (Kadota et al. 2007, Maynard et al. 2008, Birney et al. 2010, Kasowski et al. 2010, McDaniell 

150 etal. 2010). 
151 
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152 METHODS 

153 

154 1. Samples 

155 A total of 334 human samples, mostly cancer samples of various tissues, was used to detect 

156 monoallelically methylated loci (Supplemental Table 1.1). Of these 334 samples, 215 

157 samples were of female origin and only these were used to analyse the X-chromosome. 

158 Genomic DNA (gDNA) was extracted from these samples with the Easy DNA kit (Invitrogen 

159 K1 800-01) using protocol #4 from the manufacturer manual. The DNA-concentration was 

160 measured on a Nanodrop ND-1000. Subsequently, the gDNA was sheared on the Covaris 

161 S2 with following settings: duty cycle 10%, intensity 5, 200 cycles per burst during 180 

162 seconds to obtain fragments with an average length of 200 bp. The power mode was 

163 frequency sweeping, temperature 6-8 °C and water level of 12. 500 ng was loaded in 130 pi 

164 TE (1 :5) in a microtube with AFA intensifier. 
165 

166 2. Methyl-CpG binding domain sequencing 

167 Methyl-CpG binding domain protein sequencing (MethylCap-seq) (Serre et al. 2010), which 

168 combines enrichment of methylated DNA-fragments by MBD-based affinity purification with 

169 massively parallel sequencing, was used to profile the DNA-methylation pattern of the 334 

170 samples. The samples were sequenced according to the protocol described in the paper of 

171 De Meyer et al. (2013) with some additional modifications: i) After DNA-fragmentation, the 

172 methylated fragments were captured using Diagenode's MethylCap kit starting from a DNA- 

173 concentration of 500 ng instead of 200 ng. ii) Paired-end sequencing was done on either the 

174 lllumina GAIIx or the HiSeq platform. Depending on the sequencing platform, the obtained 

175 paired-end sequence reads were 45 or 51 bp, respectively. 
176 

177 3. Data pre-processing 

178 The rationale behind the proposed methodology is that biallelic DNA-methylation results in 
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179 MethylCap-seq data which is in Hardy-Weinberg equilibrium for each locus, i.e. if SNPs are 

180 present for a locus, both homozygous and heterozygous subjects will be detected at a 

181 predictable rate (Mayo 2008). However, in case of monoallelic methylation, heterozygous 

182 samples will no longer be detected resulting in deviation from the Hardy-Weinberg 

183 equilibrium, which can be measured. For a detailed description of the statistical framework, 

184 see the Supplemental Methods. Figure 1 gives an overall representation of the workflow 

185 starting from MethylCap-seq data. 
186 

187 3.1 Mapping 

188 For each of the 334 samples, the MethylCap-seq paired-end reads were mapped using 

189 BOWTIE (Langmead et al. 2009). The mapping parameters were chosen so that only those 

190 paired-end reads that mapped uniquely on the human hg19/GRCh37 reference assembly 

191 within a maximum of 400 bp of each other were retained. In order to both reduce the 

192 presence of sequencing errors as well as to allow the occurrence of real SNPs, a maximum 

193 of only three mismatches was allowed. Duplicate fragments, i.e. fragments with the exact 

194 same location of both paired-end reads, were disposed as these are most likely the result of 

195 amplification of the same sequence reads during the library preparation. 

196 

197 3.2 SNP tracing 

198 The non-duplicate, uniquely mappable reads were subsequently screened for SNPs. Only 

199 positions that showed a mismatch in the mapping of one or more samples and that 

200 overlapped known single nucleotide variations (SNVs) of the Single Nucleotide 

201 Polymorphism Database of NCBI (dbSNP, version 137) were withheld. Not keeping all the 

202 mismatches reduces both the effect of sequencing errors (false positives) and the 

203 computational load in the further analyses. Also, for each locus, the coverage of each SNP 

204 variant was determined, and the allele frequencies were estimated. 
205 
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206 3.3 Additional data filtering and correction 

207 Both for computational reasons and as a first filtering step for sequencing errors, SNP loci 

208 with a very high major allele frequency were filtered (threshold 0.9). Additionally, a minimal 

209 total coverage threshold, i.e. across all samples, for each SNP locus was imposed (350 ~ 1x 

210 per sample). Note that loci not fulfilling both criteria are unlikely to provide sufficient power for 

211 the subsequent statistical analysis. As analysis of the X-chromosome involved fewer 

212 samples, the threshold for the coverage was set to a less stringent value, namely 250 

213 instead of 350, which roughly corresponded to the number of female subjects. 
214 

215 In this reduced data set, an additional sequencing error correction was performed. For 

216 computational reasons, a simple Bayesian methodology was implemented. Basically, for 

217 each sample and locus, the chance of obtaining a certain profile was calculated under i) the 

218 assumption of heterozygosity and, ii) the assumption of homozygosity but with additional 

219 sequencing errors. The option with the largest a posteriori change was withheld (with alleles 

220 representing putative sequencing errors being removed from the data set). As the prior 

221 chances of homozygosity and heterozygosity were based on the allele frequencies, which 

222 are updated upon each round of the Bayesian algorithm, this method was performed twice 

223 (see Supplemental Methods section 2.1.3). This approach can be considered to be 

224 conservative (i.e. to disfavour the presence of monoallelic DNA-methylation), as i) only two 

225 rounds of correction were applied and, ii) the sequencing error estimate (0.25%, based on 

226 Quail et al. 2012) is on the lower bound of estimates reported and is based on the 

227 performance of the lllumina HiSeq, whereas also more error prone GAIIx data were included 

228 in this study. 
229 

230 4. Detection of monoallelically methylated loci 

231 After additional filtering and data correction, the remaining data were used as input of the 

232 new data-analytical framework developed in the R statistical environment (R 2.15.2). The 

233 statistical strategy and practical implementation are elaborated in the Supplemental Methods 
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234 (section 2.1). In summary, based on the observed allele frequencies, theoretically expected 

235 genotype frequencies can be calculated assuming Hardy-Weinberg equilibrium in the overall 

236 data set. If the observed frequency of heterozygote individuals is significantly reduced 

237 relative to Hardy-Weinberg expectations, this indicates significant monoallelic methylation. 

238 Null distributions were generated using random data with the same allele frequencies and 

239 sample coverages (for that locus) as in the original data. This approach accounts for the 

240 increased likelihood of erroneously calling loci with a low coverage homozygous. P-values 

241 were determined by comparison of the observed frequency of heterozygotes with the 

242 generated null distributions. Only loci that obtained a p-value smaller than or equal to 0.005 

243 after the first iteration were kept as input for the second iteration. Thus, after the first iteration 

244 round, loci that were in all probability not monoallelically methylated, were filtered out as to 

245 reduce the computational time in the second iteration. At the end of the second iteration the 

246 algorithm obtained a p-value for each locus. If this p-value was smaller than the p-value 

247 corresponding with an FDR of 0.1, monoallelic methylation on this locus was called 

248 significant. This procedure was also performed two times, a first time with 1,000 and a 

249 second time with 1 ,000,000 iterations. To summarize results, significant loci were visualised 

250 on a circular plot with the Circos tool (Krzywinski et al. 2009). 
251 

252 5. Functional annotation and enrichment analysis 

253 Successful completion of the monoallelically methylated loci detection pipeline resulted in a 

254 list of significant SNPs. The functional annotation (i.e. promoter, exon, intron and intergenic) 

255 of these SNP positions was determined using Ensembl (release 66), wherein the promoter 

256 was defined as starting from 2000 bp upstream until the transcriptional start site. 
257 

258 We tested for enrichment in one or more of these functional categories. A null distribution 

259 was generated by random sampling from the total amount of detected SNPs (after filtering as 

260 specified in Methods section 3.3) and counting the occurrences of the respective 

261 annotations. During this sampling procedure, the number of SNPs sampled for each 
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262 chromosome was equal to the number of significant SNPs on that chromosome. This 

263 sampling was repeated 1,000 times. With the null distribution obtained for each of these 

264 functional locations (i.e. promoter, exon, intron and intergenic), it was possible to calculate a 

265 two-sided p-value for each functional location. For loci that were featured by more than one 

266 functional annotation (i.e. overlapping genes and/or different transcripts and/or sense and 

267 antisense strand) the score for the functional location was divided by the amount of different 

268 functional locations that this locus has (the sum always being one). For example, if a locus is 

269 located in an exon on the sense strand but is also located in an intron on the other strand, 

270 both the exon and intron were attributed a score of 0.5. 
271 

272 6. Validation of putative loci using 14 whole-genome bisulfite sequencing data sets 

273 

274 In order to evaluate the loci detected by this novel methodology, an extra validation step was 

275 performed using 14 publicly available WGBS data sets comprising a range of tissue types. 

276 The WGBS data sets were downloaded from the Gene Expression Omnibus (GEO) 

277 repository (Edgar et al. 2002). A summary of the data sets including accession numbers is 

278 provided in Supplemental Table 1.3. The 14 samples were aligned in a window of 2000 bp 

279 (1000 bp upstream and 1000 bp downstream) around the candidate SNP positions 

280 (hg19/GRCh37 reference assembly) using BISMARK (Krueger and Andrews 2011). After 

281 excluding duplicates, only reads mapping onto one of the candidate monoallelically SNP 

282 positions were kept. Next, for each SNP position and each sample the methylation calls of all 

283 CpGs were summarized per SNP allele (covered by the reads on the specific SNP position). 

284 To assess monoallelic DNA-methylation in the SNP loci a Pearson chi-square test was 

285 performed. Samples that were not covered or were homozygous for the particular locus were 

286 excluded. In summary, for each heterozygous sample a chi-square value was calculated 

287 based on the degree of (non-)methylation obtained for each SNP allele, with a high chi- 

288 square value indicating that the methylation degree is allele dependent (i.e. monoallelic 

289 methylation). Null distributions were made by a permutation approach (using the chisq.test 
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290 function of the R Stats package) generating 2000 random chi-square values for each sample, 

291 making it possible to determine a sample-specific p-value for each SNP-loci. By summing the 

292 chi-square values over all heterozygous samples for a specific SNP locus and again 

293 generating null distributions of random chi-square values, also a global p-value for a SNP 

294 locus could be obtained. Note that this test does not require absolute absence of methylation 

295 of one allele, which would be too strict given the possibility of incomplete bisulfite conversion 

296 and the presence of both sequencing errors and sequencing bias. 
297 

298 RESULTS 

299 

300 Mapping 

301 For the 334 samples the mapping resulted in 2,995,375,490 uniquely mapped reads and an 

302 average mapping percentage of 63.05% (Supplemental Table 1.1). After removing the 

303 duplicate fragments a total of 2,688,409,588 non-duplicate, uniquely mapped reads was 

304 acquired. 
305 

306 SNP tracing and data filtering 

307 After parsing the mapping output for SNPs (= mapping mismatches), 19,850,891 SNPs 

308 overlapped with already known SNV positions from dbSNP. These 19,850,891 loci represent 

309 41 .61% of the total number of SNV present in dbSNP and only these SNPs were used in the 

310 remainder of the analysis. Supplemental Table 1.2 details the number of SNPs that 

311 overlapped with dbSNP per chromosome. 
312 

313 After preprocessing the data, the corresponding coverage and allele frequencies were 

314 calculated for each of the 19,850,891 loci and subsequently used to filter the data. Only 

315 positions with a frequency of the major allele smaller than 0.90 and coverage larger than or 

316 equal to 350 (250 for chromosome X) were retained. 486,090 out of 19,850,891 loci (2.45%) 

317 complied with these thresholds. Supplemental Table 1 .2 shows the number of SNP positions 
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318 that were retained after filtering as well as the fraction per chromosome. 
319 

320 Detection of monoallelically methylated loci 

321 Likely sequencing errors in the list of filtered loci were adjusted (see Methods section 3.3 and 

322 Supplemental Methods section 2.1 .3). Corrected data (available as Supplemental Data) were 

323 subsequently analysed using the developed statistical methodology. If the p-value obtained 

324 for a locus was smaller than the p-value corresponding with an FDR of 0.1 (p-value = 

325 0.000016), the monoallelic methylation on this locus was called significant. This was true for 

326 80 loci (see Table 1). Figure 2 depicts the genomic distribution of these 80 monoallelically 

327 methylated loci. 
328 

329 In a next step, the functional location of the 80 loci with significant monoallelic DNA- 

330 methylation was determined. These results are shown in Table 2. Table 3 provides an 

331 overview of the genes in which a significant SNP position was found. Thus, of the 80 

332 detected loci, 49 are located in a genie region (i.e. promoter, exon, intron) of which 25 are 

333 located in regions with (some) evidence, i.e. monoallelic expression, of imprinting. (Morison 

334 et al. 2001 ; Zhang et al. 2010, http://www.geneimprint.com). 
335 

336 Functional enrichment of loci with significant monoallelic DNA-methylation 

337 Figure 3(a) represents the relative number of the different functional annotations of these 80 

338 loci. No significant enrichment was found when genie regions were compared to intergenic 

339 regions (data not shown). The majority of the significant SNP positions are located in intronic 

340 (43.33%) and intergenic regions (37.5%). Additionally, a significant number was found in the 

341 promoter regions (13.96%). A minority of 5.21% mapped to exonic regions. In order to 

342 investigate whether one of these functional locations was under- or overrepresented 

343 compared to random data, we also performed an enrichment analysis. Figure 3(b) shows the 

344 mean classification of SNPs after 1,000 random samplings. By comparing the outcome of 

345 this random sampling with the functional locations of the 80 significant loci (see Methods 
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346 section 5), the analysis indicated a significant enrichment in promoter methylation (p = 

347 0.002), but not in other functional locations. 
348 

349 Validation of putative loci using 14 whole-genome bisulfite sequencing data sets 

350 After preprocessing the 14 WGBS data sets as outlined in Methods section 6, 44 out of the 

351 80 significant were covered by at least one heterozygous sample. Table 4 summarizes both 

352 the global and the sample-specific p-values obtained for each of these 44 loci. 29 loci 

353 (65.9%) had a global p-value lower than 0.05 of which 24 (54.5%) even had a global p-value 

354 virtually equal to 0 suggesting monoallelic methylation in at least one of the 14 samples. 
355 

356 

357 DISCUSSION 

358 

359 Monoallelic gene expression is typically initiated early in the development of an organism and 

360 stably maintained. Erroneous monoallelic expression has been related to several non- 
361 Mendelian inherited genetic disorders. DNA-methylation plays a significant role in the 

362 regulation of monoallelic expression. The choice of which allele will be monoallelically 

363 expressed can be either random or a priori defined by imprinting. Here we introduced a 

364 methodology to screen for genes that exhibit monoallelic DNA-methylation and thus might 

365 regulate monoallelic expression. 
366 

367 Using MethylCap-seq, methylome profiles of 334 samples, mostly human cancer samples of 

368 diverse origin, were obtained. In summary, upon extra filtering and data correction, for each 

369 SNP locus the Hardy-Weinberg theorem was applied to evaluate whether the observed 

370 frequency of samples featured by biallelic methylation is lower than randomly expected. 

371 Using a permutation approach, loci with a p-value smaller than the p-value corresponding 

372 with a selected FDR of 0.1 were assumed to be monoallelically methylated. Finally, this 

373 resulted in the identification of 80 loci that showed significant monoallelic DNA-methylation. 
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374 

375 Functional location of these monoallelic events might provide deeper insight in the unraveling 

376 of monoallelic mechanisms and are provided in Tables 2 and 3. It is common that imprinted 

377 genes are present within clusters and share common regulatory elements, such as non- 
378 coding RNAs and differentially methylated regions (DMRs). If these DMRs control the 

379 imprinting of one or more genes, these regions are called imprinting control regions (ICRs). It 

380 is known that many of these ICRs are located in intergenic regions. As some of the found loci 

381 are located in intergenic regions as well as in known (long) non-coding RNAs (IncRNAs) (see 

382 Tables 2 and 3, respectively), it is possible that these regions present new regulatory 

383 elements involved in imprinting. Furthermore, when we take a closer look at the intergenic 

384 regions, the SNP on chromosome 2 with position 207,122,438 also shows significant 

385 monoallelic methylation. This is interesting, because this locus falls in GPR1AS, a recently 

386 found imprinted IncRNA in the GPR1-ZDBF2 intergenic region (Kobayashi et al. 2013), 

387 corroborating the outcome of this study and indicating that the intergenic regions are also of 

388 interest for further analyses. Of the 80 loci, 49 were located in genie regions of which 25 are 

389 already linked to imprinting, further strengthening our results. For example, on chromosome 

390 1 1 (2.01 -2.03 Mb), the IGF2/H19 region was highlighted with 6 SNPs (Figure 2). This locus is 

391 a well known imprinted region that's also linked to the Beckwith-Wiedemann syndrome and 

392 Wilms' tumor (Steenman et al. 1994; Egger et al. 2004; Robertson 2005; Hubertus et al. 

393 2011). The H19 gene codes for a IncRNA of which expression is negatively correlated with 

394 the expression of the neighbouring gene insulin-like growth factor 2 (IGF2). Usually the 

395 paternal copy of H19 is methylated and silent, while the maternal copy is hypo- or 

396 unmethylated and expressed. The same is true for the imprinted region on chromosome 15 

397 that is correlated to the Prader-Willi syndrome (20.7-30.3 Mb) (Egger et al. 2004). In SNRPN, 

398 one of the genes in this region where loss of imprinting is linked to the Prader-Willi 

399 syndrome, 2 significant SNPs were identified. For a couple of genes (or regions), like for 

400 example H19, more than one significant SNP locus was found. Because some of these 

401 SNPs are in a distance of more than 400 bp (the cut-off length of sequence reads during 
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402 mapping) of each other, these prove independently the presence of monoallelic DNA- 

403 methylation in that particular region. These SNPs thus provide 'multiple proof in the 

404 identification of the particular monoallelically methylated region and lend added value to the 

405 results. Not unexpectedly, Figure 3 and the enrichment analysis clearly demonstrated 

406 enrichment for monoallelic methylation in promoter regions, although it should be noted that 

407 this enrichment is rather limited in absolute number. 
408 

409 Further validation was performed using 14 publicly available WGBS data sets. Of the 80 

410 significant loci, 44 were covered by a heterozygous sample and could thus be further 

411 examined. 29 of the 44 loci (65.9%) obtained a global p-value lower than 0.05 of which 24 

412 (54.5%) a global p-value of virtually 0, indicating monoallelic methylation in one or more 

413 samples. For only 9 of these 24 SNP loci evidence of imprinting already exists, so that with 

414 this subset of 14 WGBS samples at least 15 new monoallelically methylated regions, found 

415 with our new data-analytical framework, are validated. 
416 

417 There are a couple of important remarks that come with the proposed methodology: i) The 

418 basic assumption that MethylCap-seq data from biallelically methylated loci are generally in 

419 Hardy-Weinberg equilibrium only holds for samples originating from a panmictic population 

420 (i.e. a single population that is long-term randomly mating). If this is not the case and the 

421 samples are not panmictic, this could possibly give rise to some false positives. Thus, for 

422 samples that slightly deviate from the assumption of panmixia, an extra validation of the 

423 resulting loci is necessary to assure qualitative results (as was done in this study), ii) The 

424 approach doesn't take into account that loci with monoallelic methylation will be picked up 

425 less efficiently than biallelic loci resulting in less power leading to a less efficient detection of 

426 monoallelically methylated loci. By consequence, the methodology is less sensitive and thus 

427 too conservative, though this has no effect on the reliability of those results deemed 

428 significant, iii) To eliminate sequencing errors as well as to reduce the computational time 

429 and effort a filtering step was performed. Consequently, some data will not be analysed and 
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430 this could interfere with the detection of monoallelic DNA-methylation. However, the benefits 

431 of filtering outweigh the possible drawbacks: the computational load reduces significantly and 

432 loci that do not pass the filter cut-off will have insufficient power to be detected, iv) The 

433 approach used to correct for possible sequencing errors disfavors the presence of 

434 monoallelic DNA-methylation: only two correction rounds were performed and the 

435 sequencing error estimate of 0.25% is the lowest estimate reported (Quail et al. 2012). But 

436 although the correction method can be considered a bit too stringent, it will assure a better 

437 quality of the obtained results and will not give rise to more false positives. In fact, it will 

438 possibly reduce the amount of false negatives and thus allows a more sensitive identification 

439 of monoallelically methylated loci that would otherwise not have been detected, v) The 

440 analysis was performed with samples originating from different tissues that were mostly 

441 cancer tumors. The fact that tumors are epigenetically less stable (Varley et al. 2013), makes 

442 it probably more difficult to detect monoallelic methylation. 

443 Although we opted to use a stringent approach, the outcome clearly demonstrates that our 

444 methodology is still sensitive enough and produces satisfying results of high quality. Because 

445 the methodology used a mixture of samples originating from different tissues and demanded 

446 a high overall coverage, the identified loci are presumably monoallelically methylated in a 

447 variety of cell types, underlining their importance and biological relevance. In conclusion, the 

448 obtained results prove that the proposed methodology is effective. In the future, it would also 

449 be very informative to repeat the analysis on samples that are of normal origin and in extent 

450 from the same tissue. The latter would be very valuable in the study of tissue-specific 

451 monoallelic DNA-methylation, expression and thus tissue-specific cell differentiation. Next to 

452 MethylCap-seq, our approach also opens the door to other applications, like ChlP-seq-based 

453 detection of monoallelic protein-DNA binding events and histone modifications. 
454 

455 DATA ACCESS 

456 

457 The filtered and corrected SNP data used as starting data for our developed methodology is 

458 made available as Supplemental Data. 
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470 FIGURE LEGENDS 

471 

472 Figure 1: 

473 

474 Overview of the developed bioinformatics pipeline to detect putative monoallelically 

475 methylated SNP loci starting from MethylCap-seq data. After mapping with BOWTIE the non- 
476 duplicate, uniquely mapped reads are screened for SNPs using dbSNP. To reduce the 

477 computational load SNP loci with a too high major allele frequency (MAF) and/or a too low 

478 overall coverage are filtered. In this reduced data set, an additional sequencing error 

479 correction was performed with two iterations. The corrected data was next put in the newly 

480 developed data-analytical framework with 1 ,000 and 1 ,000,000 iterations, respectively. Only 

481 loci that obtained a p-value smaller than or equal to 0.005 after the first iteration were kept as 

482 input for the second iteration. If the p-value obtained for a locus was smaller than the p-value 

483 corresponding with an FDR of 0.1 the monoallelic methylation on this locus was called 

484 significant. After determining the functional annotation of these SNP positions an enrichment 

485 analysis was performed. Finally, the resulting loci were validated using both literature and 

486 whole-genome bisulfite sequencing (WGBS) data. 
487 

488 Figure 2: 

489 Circular representation of the genomic distribution of the 80 loci for which the monoallelic 
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490 DNA-methylation was called significant. Chromosomes are divided in regions of 5,000,000 

491 bp. The inner circle shows the histogram of all SNPs found in a specific region, whereas the 

492 outer circle shows the histograms of the significant SNPs in that same region, normalized to 

493 the number of SNPs found in that region. 
494 

495 Figure 3: 

496 Pie charts representing the relative number of significant SNPs in the different functional 

497 classes (i.e. promoter, exon, intron and intergenic). (a) Functional classification of the 

498 significant SNPs (i.e. loci with significant monoallelic DNA-methylation). (b) Functional 

499 classification of random SNPs resulting from 1 ,000 iterations. 
500 

501 
502 
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513 

514 Figure 3(a): 

515 

significant SNPs 
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518 

519 Figure 3(b): 

520 

random SNPs 
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523 TABLES 

524 

525 Table 1: Monoallelic DNA-methylation per chromosome. The first two columns show the specific chromosome 

526 (Chr) and the number of input entries for the statistical analysis. The third and fourth columns show the amount of 

527 loci, which obtained a p-value smaller than (or equal to) 0.005 (after first iteration) and 0.000016 (after second 

528 iteration, corresponding with FDR = 0.1), respectively. 
529 



Chr 


Input entries 


p < 0.005 


p< 0.000016 


1 


37,259 


98 


3 


2 


34,184 


113 


10 


3 


22,065 


73 


4 


4 


26,442 


68 


3 


5 


20,500 


54 


1 


6 


25,708 


112 


3 


7 


31,450 


153 


8 


8 


20,868 


63 


0 


9 


19,908 


78 


0 


10 


30,138 


108 


7 


11 


19,255 


82 


8 


12 


20,484 


51 


0 


13 


11,709 


88 


3 


14 


12,297 


40 


1 


15 


11,829 


47 


2 


16 


29,432 


64 


4 


17 


24,824 


158 


3 


18 


11,783 


29 


1 


19 


25,066 


96 


8 


20 


16,201 


45 


7 


21 


11,977 


40 


2 


22 


15,012 


70 


2 


X 


7699 


27 


0 


TOTAL 


486,090 


1757 





530 
531 
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532 Table 2: Functional annotation of the 80 loci with significant monoallelic DNA-methylation. Next to the functional 

533 annotation, the corresponding (Ensembl) gene ID is shown (GenelD). 
534 



Chr 


SNP 
position 


Functional 
annotation 


GenelD 




Chr 


SNP 
position 


Functional 
annotation 


GenelD 


1 


90844629 


INTERGENIC 






11 


2721568 


INTRON 


ENSG00000053918 


1 


90844662 


INTERGENIC 






11 


51579458 


INTERGENIC 




1 


228757936 


PROMOTER 


ENSG00000200624 




13 


31481030 


INTRON 


ENSG000001 02802 


1 


228757936 


INTRON 


ENSG000001 83929 




13 


85969909 


INTRON 


ENSG00000226317 


2 


69347244 


INTRON 


ENSG000001 69604 




13 


85969941 


INTRON 


ENSG00000226317 


2 


133018988 


PROMOTER 


ENSG00000233786 




14 


88237822 


INTRON 


ENSG00000258807 


2 


133020085 


EXON 


ENSG00000233786 




15 


25123472 


INTRON 


ENSG000001 28739 


2 


133023691 


INTERGENIC 






15 


25201659 


INTRON 


ENSG000001 28739 


2 


133023703 


INTERGENIC 






15 


25201659 


INTRON 


ENSG00000214265 


2 


133023711 


INTERGENIC 






16 


3493495 


EXON 


ENSG000001 67981 


2 


133029769 


INTERGENIC 






16 


3493495 


PROMOTER 


ENSG000001 22390 


2 


133032580 


INTERGENIC 






16 


11415785 


INTRON 


ENSG000001 75643 


2 


133033524 


INTERGENIC 






16 


33960762 


PROMOTER 


ENSG00000207986 


2 


2071 22438 


INTERGENIC 






16 


46411729 


INTERGENIC 




3 


5156139 


INTERGENIC 






17 


22252007 


INTERGENIC 




3 


5156149 


INTERGENIC 






17 


22259640 


INTERGENIC 




3 


162561619 


INTERGENIC 






17 


31 340444 


EXON 


ENSG000001 08684 


3 


196722009 


INTRON 


ENSG000001 63975 




18 


18517029 


INTERGENIC 




4 


7635629 


INTRON 


ENSG000001 84985 




19 


15279411 


INTRON 


ENSG00000074181 


4 


49099668 


INTERGENIC 






19 


24184564 


PROMOTER 


ENSG00000251948 


4 


89618837 


INTRON 


ENSG000001 38641 




19 


54040861 


PROMOTER 


ENSG000001 30844 


4 


89618837 


EXON 


ENSG000001 77432 




19 


54040861 


INTRON 


ENSG000001 30844 


5 


178650557 


INTRON 


ENSG00000087116 




19 


54040861 


EXON 


ENSG000001 30844 


6 


3849305 


PROMOTER 


ENSG000001 45945 




19 


54041242 


PROMOTER 


ENSG000001 30844 


C 

D 


QQ/i qqnR 
ootyouj 


IMTRHM 








JtUt 1 


IMTROM 




6 


168784228 


INTERGENIC 






19 


54057156 


INTRON 


ENSG000001 30844 


6 


170055316 


INTRON 


ENSG000001 84465 




19 


54057156 


PROMOTER 


ENSG000001 30844 


7 


19534519 


INTRON 


ENSG00000223838 




19 


54057515 


INTRON 


ENSG000001 30844 


7 


56437045 


INTERGENIC 






19 


54057515 


PROMOTER 


ENSG000001 30844 
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7 


57554497 


INTERGENIC 






19 


54057777 


INTRON 


ENSG000001 30844 


7 


61080848 


INTERGENIC 






19 


54057777 


PROMOTER 


ENSG000001 30844 


7 


64895556 


INTERGENIC 






19 


57350463 


INTRON 


ENSG000001 98300 


7 


157923845 


INTRON 


ENGS000001 55093 




19 


57350463 


INTRON 


ENSG00000259486 


7 


158041458 


INTRON 


ENSG000001 55093 




20 


55658479 


INTERGENIC 




7 


158041459 


INTRON 


ENSG000001 55093 




20 


55658481 


INTERGENIC 




10 


39086215 


INTERGENIC 






20 


57415110 


INTRON 


ENSG00000235590 


10 


39086885 


INTERGENIC 






20 


57415110 


EXON 


ENSG00000235590 


10 


39086905 


INTERGENIC 






20 


57415110 


PROMOTER 


ENSG00000087460 


10 


39086907 


INTERGENIC 






20 


57415110 


EXON 


ENSG00000087460 


10 


42800026 


INTERGENIC 






20 


57426449 


PROMOTER 


ENSG00000235590 


10 


102279294 


INTRON 


ENSG00000075826 




20 


57426449 


PROMOTER 


ENSG00000087460 


10 


102279294 


INTRON 


ENSG00000255339 




20 


57426449 


INTRON 


ENSG00000087460 


10 


102279294 


INTRON 


ENSG00000166136 




20 


57426726 


PROMOTER 


ENSG00000235590 


10 


102279295 


INTRON 


ENSG00000075826 




20 


57426726 


PROMOTER 


ENSG00000087460 


10 


102279295 


INTRON 


ENSG00000255339 




20 


57426726 


INTRON 


ENSG00000087460 


10 


102279295 


INTRON 


ENSG00000166136 




20 


57427132 


PROMOTER 


ENSG00000235590 


11 


2019496 


PROMOTER 


ENSG000001 30600 




20 


574271 32 


PROMOTER 


ENSG00000087460 


11 


2019496 


INTRON 


ENSG000001 30600 




20 


574271 32 


INTRON 


ENSG00000087460 


11 


2019496 


PROMOTER 


ENSG00000211502 




20 


57431165 


INTRON 


ENSG00000087460 


11 


2019618 


PROMOTER 


ENSG000001 30600 




20 


57431165 


EXON 


ENSG00000087460 


11 


2019618 


INTRON 


ENSG000001 30600 




21 


40757887 


INTRON 


ENSG000001 60183 


11 


2019618 


PROMOTER 


ENSG00000211502 




21 


40757887 


INTRON 


ENSG000001 82093 


11 


2021164 


INTRON 


ENSG000001 30600 




21 


40757887 


PROMOTER 


ENSG000001 82093 


11 


2021206 


INTRON 


ENSG000001 30600 




21 


44011806 


INTRON 


ENSG000001 83486 


11 


2021980 


INTRON 


ENSG000001 30600 




22 


42078666 


PROMOTER 


ENSG000001 00138 


11 


2022023 


INTRON 


ENSG000001 30600 




22 


42078666 


INTRON 


ENSG000001 00138 


11 


2721568 


PROMOTER 


ENSG00000258492 




22 


49077801 


INTRON 


ENSG00000219438 
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Table 3: SNPs featured by monoallelic methylation located in a gene associated region with indication of Location (chromosome:location), (Ensembl) Gene ID, Gene symbol, 
Description and Biotype. Known imprinted genes are shown in blue, predicted imprinted genes are shown in red. 



GenelD 


Gene symbol 


Description 


Biotype 


Location 


ENSG000001 83929 


DUSP5P 


Dual specificity phosphatase 5 pseudogene 


Pseudogene 


1 :228757936 


ENSG00000200624 


RN5S6 


RNA, 5S ribosomal 6 


rRNA 


1 :228757936 


ENSG000001 69604 


ANTXR1 


Anthrax toxin receptor 1 


Protein coding 


2:69347244 


ENSG00000233786 


CDC27P1 


Cell division cycle 27 homolog (S.cerevisiae) pseudogene 1 


Pseudogene 


2:133018988,133020085 


ENSG000001 63975 


MFI2 


Antigen p97 (melanoma associated) 


Protein coding 


3:196722009 


ENSG000001 84985 


SORCS2 


Sortilin-related VPS1 0 domain containing receptor 2 


Protein coding 


4:7635629 


ENSG000001 38641 


HERC3 


Hect domain and RLD 3 


Protein coding 


4:89618837 


ENSG000001 77432 


NAP1L5 


Nucleosome assembly protein 1 -like 5 


Protein coding 


4:89618837 


ENSG00000087116 


ADAMTS2 


ADAM metallopeptidase with thrombospondin type 1 motif, 2 


Protein coding 


5:178650557 


ENSG000001 45945 


FAM50B 


Family with sequence similarity 50, member B 


Protein coding 


6:3849305 


ENSG000002381 58 


RP11-420L9.4.1 


Processed transcript 


Processed transcript 


6:3849305 


ENSG000001 84465 


WDR27 


WD repeat domain 27 


Protein coding 


6:170055316 


ENSG00000223838 


AC007091.1.1 


IncRNA 


IncRNA 


7:19534519 


ENSG000001 55093 


PTPRN2 


Protein tyrosine phosphatase, receptor type, N polypeptide 2 


Protein coding 


7:1 58041 459,1 58041 458,1 57923845 


ENSG00000075826 


SEC31 


SEC31 homolog B (S.cerevisiae) 


Protein coding 


10:102279295,102279294 


ENSG00000166136 


NDUFB8 


NADH dehydrogenase (ubiquinone) 1 beta subcomplex 8, 19kDa 


Protein coding 


10:102279295,102279294 


ENSG00000053918 


KCNQ1 


Potassium voltage-gated channel, KGT-like subfamily, member 1 


Protein coding 


11:2721568 


ENSG00000258492 


KCNQ10T1 


KCNQ1 opposite strand/antisense transcript 1 


Antisense 


11:2721568 


ENSG00000211502 


MIR675 


microRNA 675 


miRNA 


11:2019496,2019618 


ENSG000001 30600 


H19 


H19, imprinted maternally expressed transcript 


Processed transcript 


1 1 :2021 1 64,201 9496,201 961 8,2021 206,2021 980,2022023 


ENSG000001 02802 


C130RF33 


Chromosome 13 open reading frame 33 


Protein coding 


13:31481030 


ENSG00000226317 


LINC00351 


Long intergenic non-protein coding RNA 351 


IncRNA 


13:85969909,85969941 


ENSG00000258807 


RP11-1152H15.1.1 


IncRNA 


IncRNA 


14:88237822 


ENSG00000214265 


SNURF 


SNRPN upstream reading frame 


Protein coding 


15:25201659 


ENSG000001 28739 


SNRPN 


Small nuclear ribonucleoprotein polypeptide N 


Protein coding 


15:25201659,25123472 


ENSG000001 22390 


NAA60 


N(alpha)-acetyltransferase 60, NatF catalytic subunit 


Protein coding 


16:3493495 


ENSG000001 67981 


ZNF597 


Zinc finger protein 597 


Protein coding 


16:3493495 


ENSG000001 75643 


RMI2 


RecQ mediated genome instability 2, homolog (S.cerevisiae) 


Protein coding 


16:11415785 


ENSG00000207986 


AC1 36932.1 


miRNA ncRNA 


miRNA 


16:33960762 


ENSG000001 08684 


ACCN1 


Amiloride-sensitive cation channel 1, neuronal 


Protein coding 


17:31340444 


ENSG00000074181 


NOTCH3 


Notch 3 


Protein coding 


19:15279411 


ENSG00000251948 


AC092279. 1 


miRNA ncRNA 


miRNA 


19:24184564 


ENSG000001 98300 


PEG3/ZIM2 


Zinc finger, imprinted 2 


Protein coding 


19:57350463 


ENSG00000259486 


ZIM2. 1 


Zinc finger, imprinted 2 


Protein coding 


19:57350463 


ENSG000001 30844 


ZNF331 


Zinc finger protein 331 


Protein coding 


1 9:5405751 5,54057777,54041 242,540571 56,54040861 


ENSG00000235590 


GNAS-AS1/SANG 


GNAS antisense RNA 1 


Antisense 


20:574271 32,5741 41 1 0,57426449,57426726 


ENSG00000087460 


GNAS 


GNAS complex locus 


Protein coding 


20:574271 32,5741 41 1 0,57426449,57426726,57431 1 65 


ENSG00000160183 


TMPRSS3 


Transmembrane protease, serine 3 


Protein coding 


21 :40757887 


ENSG000001 82093 


WRB 


Tryptophan rich basic protein 


Protein coding 


21 :40757887 


ENSG000001 83486 


MX2 


Myxovirus (influenza virus) resistance 2 (mouse) 


Protein coding 


21:44011806 


ENSG00000100138 


NHP2L1 


NHP2 non-histone chromosome protein 2-like 1 (S.cerevisiae) 


Protein coding 


22:42078666 


ENSG00000219438 


FAM19A5 


Family with sequence similarity 19, member A5 


Protein coding 


22:49077801 
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Table 4: Outcome of the additional validation of the putative loci with 14 whole-genome bisulfite sequencing (WGBS) data sets. Global and sample-specific p-values are shown 
for the 44 SNP loci (Chr:SNP location) that were covered by at least one heterozygous WGBS sample. Value '-' in the sample columns indicates that the sample did not cover 
or was not heterozygous for the corresponding SNP loci. Again, known imprinted regions are shown in blue, predicted imprinted regions are shown in red. 
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